--- title: DataLoaders keywords: fastai sidebar: home_sidebar summary: "Preprocessing and loading data for DL models." description: "Preprocessing and loading data for DL models." nb_path: "nbs/02_dataset.ipynb" ---
In this module you can have functions for a classification and a segmentation problem.
With the DataBlock API of fast.ai we can build in a easy way to training (with train attribute) and valid (with valid attribute) the datasets inside a Datasets fastai class.
With a DataLoaders we can load the data into a GPU applying the transforms. It calls the Pytorch DataLoader on each subset of Datasets.
The list of tfms applied is:
after_item: applied on each item after grabbing it inside the dataset.
before_batch: applied on the list of items before they are collated.
after_batch: applied on the batch as a whole after its construction.
b_tfms = aug_transforms(
size=size if size else (256, 1600),
max_warp=.1,
max_rotate=5.,
max_lighting=0.1)
b_tfms
dblock = DataBlock(
blocks=(ImageBlock, MultiCategoryBlock()),
get_x=ColReader(0, pref=train_path),
get_y=ColReader(1, label_delim=' '),
splitter=RandomSplitter(valid_pct=0.2, seed=42),
batch_tfms=b_tfms)
Now we can check that all is alright passing a source into the dblock and the API returns all the process explained.
dblock.summary(train_multi, show_batch=True)
And then we can create a Datasets class.
dsets = dblock.datasets(train_multi, verbose=True)
len(dsets.train), len(dsets.valid), type(dsets.valid)
dsets.vocab
t = dsets.train[-3]
t
t_decoded = dsets.decode(t)
t_decoded
To create DataLoaders the batch size is needed.
dloader = dblock.dataloaders(train_multi, bs=8, verbose=True)
type(dloader)
An example of pipeline:
f = dloader.after_item
items = [f(dloader.create_item(i)) for i in range(4)]
items[0][0].shape, items[0][1]
batch = dloader.do_batch([item for item in items]) # apply before_batch
batch[0].shape, batch[1]
show_image_batch(batch, items=4, cols=2, figsize=(15,5))
batch_tfms = dloader.after_batch(batch)
show_image_batch(batch_tfms, items=4, cols=2, figsize=(15,5))
bs = 4
dls = get_classification_dls(bs)
dls.train.show_batch(figsize=(15, 3))
dls.valid.show_batch(figsize=(15, 3))
x, y = dls.train.one_batch()
x.shape, y.shape
To get a DataLoaders object for training segmentation FastAI models, we need to:
labels_dir by the preprocessing module with the create_mask function)DatasetDatasets object.DataLoader with a batch size and pair them with a DataLoaders object.btfms = aug_transforms(
size=size,
max_warp=.1,
max_rotate=5.,
max_lighting=0.1)
btfms
def get_x(s):
img_name = s["ImageId"]
return train_path / str(img_name)
def get_y(s):
img_name = s["ImageId"].split(".")[0] + "_P.png"
return labels_path / img_name
dblock = DataBlock(
blocks=(ImageBlock, MaskBlock(codes=classes)),
get_x=get_x,
get_y=get_y,
splitter=RandomSplitter(valid_pct=0.2, seed=42),
batch_tfms=btfms)
dblock.summary(train_multi, show_batch=True)
dsets = dblock.datasets(train_multi, verbose=True)
t = dsets.train[-1]
t
t_decoded = dsets.decode(t)
t_decoded
dloader = dblock.dataloaders(train_multi, verbose=True, bs=4)
type(dloader)
xb, yb = next(iter(dloader.train))
xb.shape, yb.shape
An example of pipeline:
elems = [13, 17, 25]
n = len(elems)
figszs = (40,20)
imgszs = (25,5)
f = dloader.after_item
items = [f(dloader.create_item(i)) for i in elems]
items[0][0].shape, items[0][1].shape, items[0][0].max(), items[0][0].min()
img_batch, mask_batch = dloader.do_batch([item for item in items])
img_batch.shape, mask_batch.shape
fig, axs = plt.subplots(n, 1, figsize=figszs)
for i in range(len(axs)):
show_image(img_batch[i], ax=axs[i], figsize=imgszs)
fig, axs = plt.subplots(n, 1, figsize=figszs)
for i in range(len(axs)):
show_image(mask_batch[i], ax=axs[i], figsize=imgszs)
img_tfbatch, mask_tfbatch = dloader.after_batch((img_batch, mask_batch))
img_tfbatch.shape, mask_tfbatch.shape
fig, axs = plt.subplots(n, 1, figsize=figszs)
for i in range(len(axs)):
show_image(img_tfbatch[i], ax=axs[i], figsize=imgszs)
fig, axs = plt.subplots(n, 1, figsize=figszs)
for i in range(len(axs)):
show_image(mask_tfbatch[i], ax=axs[i], figsize=imgszs)
for img in img_tfbatch:
print(img.min(), img.max())
for mask in mask_batch:
print(mask.min(), mask.max())
The get_segmentation_dls will load from the folder all the images while the get_segmentation_dls_from_df loads the images from a custom DataFrame to train on a different subsample.
bs = 4
szs = (128, 800)
dls = get_segmentation_dls_from_df(train_multi, bs, szs)
dls.train.show_batch(figsize=(15, 3))
dls.valid.show_batch(figsize=(15, 3))
x, y = dls.train.one_batch()
x.shape, y.shape
[torch.unique(y[i]) for i in range(bs)]
get_tsfms = get_transforms('train', *imagenet_stats)
get_tsfms
steel_ds = SteelDataset(train_pivot, path, *imagenet_stats, 'train')
x,y = steel_ds[0]
show_image(x)
print(x.min(), x.max(), y.shape)
test_dataset = TestDataset(train_path, train_multi, *imagenet_stats)
img_name, img_tensor = test_dataset[0]
print(img_name)
show_image(img_tensor)
steel_dls = get_train_dls(phase='train')
test_eq(len(steel_dls), 1257)
xb, yb = next(iter(steel_dls))
xb.shape, yb.shape
test_dls = get_test_dls(path / 'sample_submission.csv')
test_eq(len(test_dls), 1377)
xb, yb = next(iter(test_dls))
xb[0], yb[0].shape